In this Kaggle competition we are asked to predict the forest cover type (the predominant kind of tree cover) from cartographic variables. The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
Elevation ------------------------ Elevation in meters
slope ------------------------ Slope in degrees
Aspect. ------------------------ Aspect in degrees azimuth
horz_hydro ------------------------- Horz Dist to nearest surface water features
vert_hydro ------------------------- Vert Dist to nearest surface water features
horz_road ------------------------- Horz Dist to nearest roadway
hillshade_9am ---------------------- Hillshade index at 9am, summer solstice (0 to 255 index)
hillshade_noon ---------------------Hillshade index at noon, summer solstice (0 to 255 index)
hillshade_3pm ---------------------- Hillshade index at 3pm, summer solstice (0 to 255 index)
horz_fire ----------------------Horz Dist to nearest wildfire ignition points
Wilderness area designation (4 binary columns)
Soil Type designation (40 binary columns)
Hillshading is a technique used to create a realistic view of terrain by creating a three-dimensional surface from a two-dimensional display of it. Hillshading creates a hypothetical illumination of a surface by setting a position for a light source and calculating an illumination value for each cell based on the cell's relative orientation to the light, or based on the slope and aspect of the cell.
from IPython.display import Image
Image("/Users/tuktuk/Downloads/image001.gif")
Spruce/Fir.
Lodgepole Pine.
Ponderosa Pine.
Cottonwood/Willow.
Aspen.
Douglas-fir.
Krummholz
from IPython.display import Image
Image("/Users/tuktuk/Downloads/1*DjIccrMeRWmrC_mCUOGDhw.png")
Define the Problem:
Gather the Data:
Prepare Data for Consumption:
Perform Exploratory Analysis:
Model Data:
Validate and Implement Data Model:
Optimize and Strategize:
Researchers at the Department of Forest Sciences at Colorado State University collected over half a million measurements from tree observations from four areas of the Roosevelt National Forest in Colorado. All observations are cartographic variables (no remote sensing) from 30-meter x 30-meter sections of forest.
The resulting dataset includes information on tree type, shadow coverage, distance to nearby landmarks (roads etcetera), soil type, and local topography. In total there are 55 columns/features.Dataset contains 581k entries with 54 attributes each. However, there are only 12 real features because two of them (Wilderness Area and Soil Type) are represented as a vector opposed to number notation(Like after one hot encoding).
Problem
Can we build a model that predicts what types of trees grow in an area based on the surrounding characteristics? Like elevation, slope, distance, soil type etcetera.
Let’s think about what could define cover type. One is defintely Soil Type. Second is proximity to water body and fire ignition points. Elevation also decides growth of some tress. Sunlight also decides growth of the tress. Slope also plays a role for type of tress. Proximity to the roadways (effect of pllution and human intervention?)
For this project, the problem statement is given to us on a golden plater, build a model that predicts what types of trees grow in an area based on the surrounding characteristics? Like elevation, slope, distance, soil type etcetera.
Data is downloaded from Kaggle. It consists of train, test and submission datasets in csv formats.
""" https://www.kaggle.com/c/forest-cover-type-kernels-only/data"""
Since step 2 was provided to us , Therefore, normal processes in data wrangling, such as data architecture, governance, and extraction are out of scope. Thus, only data cleaning is in scope.
We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.
# NumPy
import numpy as np
# Dataframe operations
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# preprocessing functions
from sklearn.preprocessing import normalize
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Feature Selection
from sklearn import feature_selection
from sklearn.feature_selection import VarianceThreshold
from statsmodels.stats.outliers_influence import variance_inflation_factor
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
from sklearn.feature_selection import RFECV
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Scalers
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
# splitting in train and test
from sklearn.model_selection import train_test_split
#sklearn and other model libraries
import sklearn
from sklearn.feature_selection import SelectFromModel
from sklearn import linear_model
from sklearn import tree
from sklearn import neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
# Models
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier,VotingClassifier,AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
import xgboost as xgb
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
# Cross-validation
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
from sklearn.model_selection import cross_validate
# For pipeline
from sklearn.pipeline import Pipeline
# Checking Accuracy
from sklearn import metrics #accuracy measure
from sklearn.metrics import precision_score, recall_score, f1_score, precision_recall_curve
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score, r2_score, mean_squared_error
# model selection and Optimising models
from sklearn import model_selection
#RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
# GridSearchCV
from sklearn.model_selection import GridSearchCV
#Common Model Helpers
from sklearn import metrics
import random
import time
train = pd.read_csv(r"/Users/tuktuk/Downloads/forest-cover-type-kernels-only/train.csv.zip")
test = pd.read_csv(r"/Users/tuktuk/Downloads/forest-cover-type-kernels-only/test.csv.zip")
data = [train, test] # for operations on both datsets
print(train.shape, test.shape)
# test data is almost 4 times of train data
train.head()
test.sample(5)
The dataset has 55 columns in total where Wilderness_Area consists of 4 dummy variables and Soil_Tpye consists of 40 dummy variables.
Continuous Data
Elevation (in meters), Aspect (in degrees azimuth1), Slope (in degrees), Horizontal_Distance_To_Hydrology (Horizontal distance to nearest surface water features in meters), Horizontal_Distance_To_Roadways (Horizontal distance to nearest roadway in meters), Horizontal_Distance_To_Fire_Points (Horizontal distance to nearest wildfire ignition points in meters), Vertical_Distance_To_Hydrology (Vertical distance to nearest surface water features in meters), Hillshade_9am (Hill shade index at 9am, summer solstice. Value out of 255), Hillshade_Noon (Hill shade index at noon, summer solstice. Value out of 255), Hillshade_3pm (Hill shade index at 3pm, summer solstice. Value out of 255),
Categorical Data
Wilderness Area (4 dummy variable binary columns, 0 = absence or 1 = presence)
Soil Type (40 dummy variable binary columns, 0 = absence or 1 = presence)
The target variable Cover_Type is defined as an integer value between 1 and 7, with the following key:
Spruce/Fir.
Lodgepole Pine.
Ponderosa Pine.
Cottonwood/Willow
Aspen
Douglas-fir
Krummholz
train.info()
test.info()
train.describe()
#list of discrete variables
discrete_vars = [var for var in train.columns if len(train[var].unique())<20 and var not in ['Id', 'Cover_Type']]
print('Number of discrete variables: ', discrete_vars, len(discrete_vars))
# list of continuous variables
cont_vars = [var for var in train.columns if var not in discrete_vars+['Id', 'Cover_Type']]
print('Number of continuous variables: ', cont_vars, len(cont_vars))
for var in discrete_vars:
print(var, train[var].unique())
for var in cont_vars:
print(var, train[var].min(),'-',train[var].max())
train.isnull().sum()
test.isnull().sum()
cont_vars
plt.figure(figsize=(16,10))
plt.subplot(241)
sns.boxplot(train['Elevation'],showmeans = True, meanline = True)
plt.title('Train Elevation')
plt.subplot(242)
sns.boxplot(test['Elevation'],showmeans = True, meanline = True)
plt.title('Test Elevation')
plt.subplot(243)
sns.boxplot(train['Aspect'],showmeans = True, meanline = True)
plt.title('Train Aspect')
plt.subplot(244)
sns.boxplot(test['Aspect'],showmeans = True, meanline = True)
plt.title('Test Aspect')
plt.subplot(245)
sns.boxplot(train['Slope'],showmeans = True, meanline = True)
plt.title('Train Slope')
plt.subplot(246)
sns.boxplot(test['Slope'],showmeans = True, meanline = True)
plt.title('Test Slope')
plt.subplot(247)
sns.boxplot(train['Horizontal_Distance_To_Hydrology'],showmeans = True, meanline = True)
plt.title('Train Horiz_Hydro')
plt.subplot(248)
sns.boxplot(test['Horizontal_Distance_To_Hydrology'],showmeans = True, meanline = True)
plt.title('Test Horiz_Hydro');
plt.figure(figsize=(16,10))
plt.subplot(241)
sns.boxplot(train['Vertical_Distance_To_Hydrology'],showmeans = True, meanline = True)
plt.title('Tr_Vertical_Hydrology')
plt.subplot(242)
sns.boxplot(test['Vertical_Distance_To_Hydrology'],showmeans = True, meanline = True)
plt.title('TestVertical_Hydrology')
plt.subplot(243)
sns.boxplot(train['Horizontal_Distance_To_Roadways'],showmeans = True, meanline = True)
plt.title('Train_Horizontal_Roadways')
plt.subplot(244)
sns.boxplot(test['Horizontal_Distance_To_Roadways'],showmeans = True, meanline = True)
plt.title('Test_Horizontal_Roadways')
plt.subplot(245)
sns.boxplot(train['Hillshade_9am'],showmeans = True, meanline = True)
plt.title('Trainshade_9am')
plt.subplot(246)
sns.boxplot(test['Hillshade_9am'],showmeans = True, meanline = True)
plt.title('Testshade_9am')
plt.subplot(247)
sns.boxplot(train['Hillshade_Noon'],showmeans = True, meanline = True)
plt.title('Trainshade_Noon')
plt.subplot(248)
sns.boxplot(test['Hillshade_Noon'],showmeans = True, meanline = True);
plt.title('Testshade_Noon');
plt.figure(figsize=(16,5))
plt.subplot(141)
sns.boxplot(train['Hillshade_3pm'],showmeans = True, meanline = True)
plt.title('Trainshade_3pm')
plt.subplot(142)
sns.boxplot(test['Hillshade_3pm'],showmeans = True, meanline = True)
plt.title('Testshade_3pm')
plt.subplot(143)
sns.boxplot(train['Horizontal_Distance_To_Fire_Points'],showmeans = True, meanline = True)
plt.title('Train Horiz_Fire')
plt.subplot(144)
sns.boxplot(test['Horizontal_Distance_To_Fire_Points'],showmeans = True, meanline = True)
plt.title('Test Horiz_Fire');
list1 = train['Elevation'].tolist()
list1
array1 = np.array(list1)
array1
first = np.percentile(array1, 25)
first
def outlier_function(df, col_name):
''' this function detects first and third quartile and interquartile range for a given column of a dataframe
then calculates upper and lower limits to determine outliers conservatively
returns the number of lower and uper limit and number of outliers respectively
'''
first_quartile = np.percentile(np.array(df[col_name].tolist()), 25)
third_quartile = np.percentile(np.array(df[col_name].tolist()), 75)
IQR = third_quartile - first_quartile
upper_limit = third_quartile+(3*IQR)
lower_limit = first_quartile-(3*IQR)
outlier_count = 0
for value in df[col_name].tolist():
if (value < lower_limit) | (value > upper_limit):
outlier_count +=1
return lower_limit, upper_limit, outlier_count
# loop through all columns to see if there are any outliers
for dataset in data:
for column in cont_vars:
if outlier_function(dataset, column)[2] > 0:
print("There are {} outliers in {}".format(outlier_function(dataset, column)[2], column))
Horizontal_Distance_To_Hydrology
Vertical_Distance_To_Hydrology
Horizontal_Distance_To_Roadways
Horizontal_Distance_To_Fire_Points
Recall the data ranges of those 4 columns:
Horizontal_Distance_To_Hydrology: 0, 1343
Vertical_Distance_To_Hydrology: -146, 554
Horizontal_Distance_To_Roadways: 0, 6890
Horizaontal_Distance_To_Firepoints: 0, 6993
Horizaontal_Distance_To_Firepoints having the highest number of outliers and widest data range.
# Removing Horizontal_Distance_To_Fire_Points outliers
trees = train[(train['Horizontal_Distance_To_Fire_Points'] > outlier_function(train, 'Horizontal_Distance_To_Fire_Points')[0]) &
(train['Horizontal_Distance_To_Fire_Points'] < outlier_function(train, 'Horizontal_Distance_To_Fire_Points')[1])]
trees.shape
train.shape
train['Cover_Type'].unique()
print(train['Cover_Type'].value_counts())
# Training set is chosen in such a fashion that each class has the same number of observations.
# Target Label Visualisation
plt.figure(figsize = (12,6))
sns.countplot(x= 'Cover_Type', data= train)
print(train['Cover_Type'].value_counts())
# Visualisation of continuous features
cont_vars
numerical_features = ['Elevation','Aspect','Slope','Horizontal_Distance_To_Hydrology',
'Vertical_Distance_To_Hydrology','Horizontal_Distance_To_Roadways','Hillshade_9am',
'Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points']
for feature_name in numerical_features:
plt.figure()
sns.distplot(train[feature_name], label='train')
sns.distplot(test[feature_name], label='test')
plt.legend()
plt.show()
# Lets see their relation with the target label
print("Density Plot of Elevation for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Elevation"][train.Cover_Type == 1])
sns.kdeplot(train["Elevation"][train.Cover_Type == 2])
sns.kdeplot(train["Elevation"][train.Cover_Type == 3])
sns.kdeplot(train["Elevation"][train.Cover_Type == 4])
sns.kdeplot(train["Elevation"][train.Cover_Type == 5])
sns.kdeplot(train["Elevation"][train.Cover_Type == 6])
sns.kdeplot(train["Elevation"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Elevation for Forest cover types')
plt.show();
# It can be seen that elevation is related to the cover type
# Lets see for 'Aspect and 'Slope'
print("Density Plot of Aspect for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Aspect"][train.Cover_Type == 1])
sns.kdeplot(train["Aspect"][train.Cover_Type == 2])
sns.kdeplot(train["Aspect"][train.Cover_Type == 3])
sns.kdeplot(train["Aspect"][train.Cover_Type == 4])
sns.kdeplot(train["Aspect"][train.Cover_Type == 5])
sns.kdeplot(train["Aspect"][train.Cover_Type == 6])
sns.kdeplot(train["Aspect"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Aspect for Forest cover types')
plt.show();
# It can be seen that distribution of 4 types is similar
# Lets check for slope
print("Density Plot of Slope for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Slope"][train.Cover_Type == 1])
sns.kdeplot(train["Slope"][train.Cover_Type == 2])
sns.kdeplot(train["Slope"][train.Cover_Type == 3])
sns.kdeplot(train["Slope"][train.Cover_Type == 4])
sns.kdeplot(train["Slope"][train.Cover_Type == 5])
sns.kdeplot(train["Slope"][train.Cover_Type == 6])
sns.kdeplot(train["Slope"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Slope for Forest cover types')
plt.show();
# clearly 2 sets of forest types are seen
# Lets check for 'Horizontal_Distance_To_Hydrology'
print("Density Plot of 'Horizontal_Distance_To_Hydrology' for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 1])
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 2])
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 3])
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 4])
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 5])
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 6])
sns.kdeplot(train["Horizontal_Distance_To_Hydrology"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Horizontal_Distance_To_Hydrology for Forest cover types')
plt.show();
# Except for Type 4, distribution is similar
# Lets check for 'Vertical_Distance_To_Hydrology'
print("Density Plot of Vertical_Distance_To_Hydrology for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 1])
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 2])
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 3])
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 4])
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 5])
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 6])
sns.kdeplot(train["Vertical_Distance_To_Hydrology"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Vertical_Distance_To_Hydrology for Forest cover types')
plt.show();
# Vertical_Distance_To_Hydrology for all types has similar distribution with different peaks
# Lets check for 'Horizontal_Distance_To_Roadways'
print("Density Plot of Horizontal_Distance_To_Roadways for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 1])
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 2])
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 3])
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 4])
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 5])
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 6])
sns.kdeplot(train["Horizontal_Distance_To_Roadways"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Horizontal_Distance_To_Roadways for Forest cover types')
plt.show();
# distribution shows that Horizontal_Distance_To_Roadways is related to the forest cover types
# Lets check for 'Hillshade_9am'
print("Density Plot of Hillshade_9am for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 1])
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 2])
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 3])
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 4])
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 5])
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 6])
sns.kdeplot(train["Hillshade_9am"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Hillshade_9am for Forest cover types')
plt.show();
# Except for Type 6, distribution is similar
# Lets check for Hillshade_Noon
print("Density Plot of Hillshade_Noon for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 1])
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 2])
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 3])
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 4])
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 5])
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 6])
sns.kdeplot(train["Hillshade_Noon"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Hillshade_Noon for Forest cover types')
plt.show();
# Distribution looks similar, so no major factor
# Lets check for Hillshade_3pm
print("Density Plot of Hillshade_3pm for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 1])
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 2])
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 3])
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 4])
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 5])
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 6])
sns.kdeplot(train["Hillshade_3pm"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Hillshade_3pm for Forest cover types')
plt.show();
# Except for type 4 and type 6, distribution is similar
# Lets check for 'Horizontal_Distance_To_Fire_Points'
print("Density Plot of Horizontal_Distance_To_Fire_Points for Forest cover types")
plt.figure(figsize=(15,8))
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 1])
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 2])
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 3])
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 4])
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 5])
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 6])
sns.kdeplot(train["Horizontal_Distance_To_Fire_Points"][train.Cover_Type == 7])
plt.legend(['Cover Type1', 'Cover Type2', 'Cover Type3', 'Cover Type4','Cover Type5',
'Cover Type6','Cover Type7'])
plt.title('Density Plot of Horizontal_Distance_To_Fire_Points for Forest cover types')
plt.show();
sns.lmplot(x= "Horizontal_Distance_To_Hydrology", y= "Vertical_Distance_To_Hydrology" , hue = 'Cover_Type',
data= train, fit_reg = False, height = 8);
plt.show()
sns.lmplot(x= 'Elevation', y= 'Aspect' , hue = 'Cover_Type', data= train, fit_reg = False, height = 8);
plt.show()
sns.lmplot(x= 'Elevation', y= 'Slope' , hue = 'Cover_Type', data= train, fit_reg = False, height = 8);
plt.show()
sns.lmplot(x= 'Slope', y= 'Aspect' , hue = 'Cover_Type', data= train, fit_reg = False, height = 8);
sns.lmplot(x= 'Horizontal_Distance_To_Hydrology', y= 'Horizontal_Distance_To_Fire_Points' , hue = 'Cover_Type',
data= train, fit_reg = False, height = 8);
sns.lmplot(x= 'Horizontal_Distance_To_Roadways', y= 'Horizontal_Distance_To_Fire_Points' , hue = 'Cover_Type',
data= train, fit_reg = False, height = 8);
# convert dummies into a single column
# convert wilderness
wild_dummies = train[['Wilderness_Area1','Wilderness_Area2','Wilderness_Area3','Wilderness_Area4']]
wild = wild_dummies.idxmax(axis=1)
wild.name = 'wilderness'
# convert soil
soil_dummies = train[train.columns[15:55]]
soil = soil_dummies.idxmax(axis=1)
soil.name = 'soil_type'
df_train = pd.concat((train['Cover_Type'], wild, soil), axis =1)
#df_train
# test data
wild_dummies1 = test[['Wilderness_Area1','Wilderness_Area2','Wilderness_Area3','Wilderness_Area4']]
wild1 = wild_dummies1.idxmax(axis=1)
wild1.name = 'wilderness'
#wild1
test.columns[15:55]
soil_dummies1 = test[test.columns[15:55]]
soil1 = soil_dummies1.idxmax(axis=1)
soil1.name = 'soil_type'
#soil1
df_test = pd.concat((test['Id'],wild1, soil1), axis =1)
plt.figure(figsize = (12,6))
sns.countplot(x='wilderness',data=df_train);
plt.show()
plt.figure(figsize = (12,6))
sns.countplot(x='wilderness',data=df_test);
plt.show()
# this is the ratio of Wilderness_Area
plt.figure(figsize=(24, 12))
plt.subplot(121)
train.filter(regex="Wilderness").sum(axis=0).plot(kind='pie')
plt.title(' Train')
plt.subplot(122)
test.filter(regex="Wilderness").sum(axis=0).plot(kind='pie')
plt.title('test')
plt.show()
plt.figure(figsize=(12,6))
sns.countplot(x='wilderness',data=df_train, hue='Cover_Type');
plt.xticks(rotation=45);
Spruce/Fir, Lodgepole Pine and Krummholz (Cover_Type 1, 2, 7) mostly found in Rawah, Neota and Comanche Peak Wilderness Area(1,2 and 3)
It is highly likely to find Ponderosa Pine (Cover_Type 3) in Cache la Poudre Wilderness Area (4) rather than other areas.
Cottonwood/Willow (Cover_Type 4) seems to be found only in Cache la Poudre Wilderness Area (4).
Aspen (Cover_Type 5) is equally likely to come from wilderness area Rawah and Comanche (1,3).
Douglas-fir (Cover_Type 6) can be found in wilderness areas 3 & 4 only.
plt.figure(figsize=(12,6))
sns.countplot(x='Cover_Type',data=df_train, hue='wilderness');
plt.xticks(rotation=45);
plt.scatter(x="Cover_Type", y = "wilderness", data = df_train)
# Distribution of Soil_Type in train data
plt.figure(figsize=(12,5))
sns.countplot(x='soil_type',data=df_train);
plt.xticks(rotation=90);
# Distribution of Soil_Type in test data
plt.figure(figsize=(12,5))
sns.countplot(x='soil_type',data=df_test);
plt.xticks(rotation=90);
# this is the ratio of soil type
plt.figure(figsize=(24, 12))
plt.subplot(121)
train.filter(regex="Soil").sum(axis=0).plot(kind='pie')
plt.title('Soil Types in Train')
plt.subplot(122)
test.filter(regex="Soil").sum(axis=0).plot(kind='pie')
plt.title('Soil Types')
plt.show()
plt.figure(figsize=(16,16))
sns.countplot(x='soil_type',data=df_train, hue='Cover_Type');
plt.xticks(rotation=45)
plt.legend(loc = 'upper right');
plt.figure(figsize=(16,16))
sns.countplot(x='Cover_Type',data=df_train, hue='soil_type');
plt.xticks(rotation=45)
plt.legend(loc = 'upper right');
# select features which are continuous
cont_features = train[['Elevation', 'Aspect', 'Slope',
'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon',
'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points','Cover_Type']]
cmap = sns.color_palette("Set2")
fig, axes = plt.subplots(ncols=2, nrows=5, figsize=(12, 24))
a = [i for i in axes for i in i]
for i, ax in enumerate(a):
sns.boxplot(x='Cover_Type', y=cont_features.columns[i], data=cont_features, palette=cmap, width=0.5, ax=ax);
# rotate x-axis for every single plot
for ax in fig.axes:
plt.sca(ax)
plt.xticks(rotation=45)
# rotate y-axis for every single plot
for ax in fig.axes:
plt.sca(ax)
plt.yticks(rotation=45)
# set spacing for every subplot, else x-axis will be covered
plt.tight_layout()
corr = train.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(corr, cmap=sns.color_palette("RdBu_r", 20));
train = trees
data = (train, test)
for dataset in data:
dataset['Slope2'] = (dataset['Horizontal_Distance_To_Hydrology']**2+dataset['Vertical_Distance_To_Hydrology']**2)**0.5
for dataset in data:
dataset['Hydro_FireSum'] = dataset['Horizontal_Distance_To_Hydrology'] + dataset['Horizontal_Distance_To_Fire_Points']
dataset['Hydro_FireDiff'] = dataset['Horizontal_Distance_To_Hydrology'] - dataset['Horizontal_Distance_To_Fire_Points']
for dataset in data:
dataset['Hydro_RoadSum'] = dataset['Horizontal_Distance_To_Hydrology'] + dataset['Horizontal_Distance_To_Roadways']
dataset['Hydro_RoadDiff'] = dataset['Horizontal_Distance_To_Hydrology'] - dataset['Horizontal_Distance_To_Roadways']
for dataset in data:
dataset['Fire_RoadSum'] = dataset['Horizontal_Distance_To_Fire_Points'] + dataset['Horizontal_Distance_To_Roadways']
dataset['Fire_RoadDiff'] =dataset['Horizontal_Distance_To_Fire_Points'] - dataset['Horizontal_Distance_To_Roadways']
for dataset in data:
dataset['Hydro_Elevation_Vert_Sum'] = dataset['Elevation'] + dataset['Vertical_Distance_To_Hydrology']
dataset['Hydro_Elevation_Vert_Diff'] = dataset['Elevation'] - dataset['Vertical_Distance_To_Hydrology']
print(train.shape, test.shape)
train.info()
# Make copy of train and test for feature selection, keeping train_orig and test_orig
train_before_FS = train.copy(deep = True)
test_before_FS = test.copy(deep = True)
# Feature selection by Correlation
corr2 = train.corr()
corr2['Cover_Type'].sort_values(ascending=False)
fig, ax = plt.subplots()
fig.set_size_inches(11,11)
sns.heatmap(corr2, cmap='magma');
def correlation(dataset, threshold):
col_corr = set()
corr_matrix = dataset.corr()
for i in range(0, len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i,j]) > threshold:
colname = corr_matrix.columns[i]
col_corr.add(colname)
return col_corr
corr_features1 = correlation(train, 0.9)
corr_features1
# As Soil_Type7 and Soil_Type15 do not have any correlation with Cover_Type, Lets drop them alongwith Id
data1 = (train, test)
for dataset in data1:
dataset.drop(columns=['Id', 'Soil_Type7', 'Soil_Type15'], axis=1, inplace = True)
train.columns
# making X (Independent variables) and y(dependent variable)
X =train.drop('Cover_Type' , axis =1)
y =train['Cover_Type']
# splitting in train and test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.05, random_state=42 )
unique, count= np.unique(y_train, return_counts=True)
print("The number of occurances of each class in the dataset = %s "% dict (zip(unique, count) ))
Lets now see if the new features we added have any segnificance for the extra tree model or not and how important are our features. We can check that through the Extra trees algorithm which can predict the useful features internally usign "feature_importances"
clf = ExtraTreesClassifier()
clf.fit(X_train,y_train)
# display the relative importance of each attribute
z = clf.feature_importances_
#make a dataframe to display every value and its column name
df = pd.DataFrame()
print(len(z))
print(len(list(X_train.columns.values)))
df["values"] = z
df['column'] = list(X_train.columns.values)
# Sort then descendingly to get the worst features at the end
df.sort_values(by='values', ascending=False, inplace = True)
df.head(100)
#### We can drop Soil_Type25 and Soil_Type8, and check accuracy
X_train1 = X_train.copy(deep=True)
X_test1 = X_test.copy(deep=True)
X_train1.drop(['Soil_Type25', 'Soil_Type8'], axis =1, inplace = True)
X_test1.drop(['Soil_Type25', 'Soil_Type8'], axis =1, inplace =True)
# Lets check accuracy on X_train and X_train1 after standard scaling
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
scaler = StandardScaler()
scaler.fit(X_train1)
X_train1 = scaler.transform(X_train1)
X_test1 = scaler.transform(X_test1)
clf = ExtraTreesClassifier(n_estimators=950, random_state=0)
clf.fit(X_train, y_train)
print('Accuracy of classifier on training set: {:.2f}'.format(clf.score(X_train, y_train) * 100))
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X_test, y_test) * 100))
clf = ExtraTreesClassifier(n_estimators=950, random_state=0)
clf.fit(X_train1, y_train)
print('Accuracy of classifier on training set: {:.2f}'.format(clf.score(X_train1, y_train) * 100))
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X_test1, y_test) * 100))
df_train['soil_type'].value_counts()
df_test['soil_type'].value_counts()
len(train_before_FS.columns), len(test_before_FS.columns)
#### Dropping features as suggested by Feature Selection
train = train_before_FS.drop(columns=['Id'], axis=1)
test = test_before_FS.drop(columns=['Id'], axis=1)
train.shape, test.shape
corr3 = train.corr()
corr3['Cover_Type'].sort_values(ascending=False)
# making X (Independent variables) and y(dependent variable)
X =train.drop('Cover_Type' , axis =1)
y =train['Cover_Type']
# splitting in train and test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.1, random_state=42 )
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
print(X_train.shape, y_train.shape)
print(X_test.shape, X_test.shape)
#Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
#Ensemble Methods
AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ExtraTreesClassifier(),
RandomForestClassifier(),
KNeighborsClassifier(),
#Trees
tree.DecisionTreeClassifier()
]
#Lets check accuracy and time of all above classification models on both train and test data
#create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters', 'Train Accuracy Mean', 'Train Accuracy *STD' ,'Test Accuracy','MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
#index through MLA and save performance to table
row_index = 0
for alg in MLA:
#set name and parameters
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
#score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
cv_results = model_selection.cross_validate(alg, X_train, y_train, cv = 10)
MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
#MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
MLA_compare.loc[row_index, 'Train Accuracy Mean'] = cv_results['test_score'].mean()
#if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
MLA_compare.loc[row_index, 'Train Accuracy *STD'] = cv_results['test_score'].std() #let's know the worst that can happen!
#save MLA predictions
alg.fit(X_train, y_train)
MLA_compare.loc[row_index, 'Test Accuracy'] = accuracy_score(y_test, alg.predict(X_test))
row_index+=1
#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['Train Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
# Visualisation of the same with barplot
figsize = (24,24)
sns.barplot(x='Train Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm');
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Train Accuracy Score (%)')
plt.ylabel('Algorithm')
plt.show()
sns.barplot(x='Test Accuracy', y = 'MLA Name', data = MLA_compare, color = 'b');
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Test Accuracy Score (%)')
plt.ylabel('Algorithm')
plt.show()
#Lets use the below function to find accuracy, Precision and recall score
def print_score(clf, X_train, y_train):
'''
print the accuracy score, classification report and confusion matrix of classifier
'''
clf.fit(X_train, y_train)
clf.predict(X_train)
print("Train Result:\n")
print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))
res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
mean_res = np.mean(res)
print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
MLAs = [ExtraTreesClassifier(),RandomForestClassifier()]
for clf in MLAs:
print(clf)
print('-'*20)
print_score(clf, X_train, y_train)
print('-'*20)
# Lets use GridSearchCV with pipeline on entire train set (X, y)
%%time
xrf_pipe = Pipeline(
steps = [
('classifier', ExtraTreesClassifier(n_estimators=500,random_state=0, criterion = 'entropy'))
]
)
xrf_param_grid = {
'classifier__min_samples_leaf': [1,4,7],
'classifier__max_depth': [34,38,32],
}
np.random.seed(1)
xrf_grid_search = GridSearchCV(xrf_pipe, xrf_param_grid, cv=5, refit='True', n_jobs=-1)
xrf_grid_search.fit(X, y)
print(xrf_grid_search.best_score_)
print(xrf_grid_search.best_params_)
xrf_model = xrf_grid_search.best_estimator_
cv_score = cross_val_score(xrf_model, X, y, cv = 5)
print(cv_score)
print("Accuracy: %0.2f (+/- %0.2f)" % (cv_score.mean(), cv_score.std() * 2))
xrf_grid_search.best_estimator_.steps[0][1]
final_model = xrf_grid_search.best_estimator_.steps[0][1]
final_model.fit(X, y)
#Lets check accuracy on test data of kaggle by submitting predictions in form of submission file
y_pred_test = final_model.predict(test)
temp = pd.DataFrame(pd.read_csv("/Users/tuktuk/Downloads/forest-cover-type-kernels-only/test.csv.zip")['Id'])
temp['Cover_Type'] = y_pred_test
temp.to_csv("/Users/tuktuk/Downloads/forest-cover-type-kernels-only/submission14Sep4.csv", index = False)
from collections import Counter
Counter(y_pred_test)
Image("/Users/tuktuk/Desktop/Screenshot 2020-09-15 at 3.58.52 PM.png")